Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch ( de novo design) or by making calculated variants of a known protein structure and its sequence (termed protein redesign). Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.
Rational protein design dates back to the mid-1970s. Recently, however, there were numerous examples of successful rational design of water-soluble and even transmembrane peptides and proteins, in part due to a better understanding of different factors contributing to protein folding and development of better computational methods.
When the first proteins were rationally designed during the 1970s and 1980s, the sequence for these was optimized manually based on analyses of other known proteins, the sequence composition, amino acid charges, and the geometry of the desired structure. The first designed proteins are attributed to Bernd Gutte, who designed a reduced version of a known catalyst, bovine ribonuclease, and tertiary structures consisting of beta-sheets and alpha-helices, including a binder of DDT. Urry and colleagues later designed elastin-like fibrous protein peptides based on rules on sequence composition. Richardson and coworkers designed a 79-residue protein with no sequence homology to a known protein. In the 1990s, the advent of powerful computers, libraries of amino acid conformations, and force fields developed mainly for molecular dynamics simulations enabled the development of structure-based computational protein design tools. Following the development of these computational tools, great success has been achieved over the last 30 years in protein design. The first protein successfully designed completely de novo was done by Stephen Mayo and coworkers in 1997, and, shortly after, in 1999 Peter S. Kim and coworkers designed dimers, trimers, and tetramers of unnatural right-handed . In 2003, David Baker's laboratory designed a full protein to a fold never seen before in nature. Later, in 2008, Baker's group computationally designed enzymes for two different reactions. In 2010, one of the most powerful broadly neutralizing antibodies was isolated from patient serum using a computationally designed protein probe. In 2024, Baker received one half of the Nobel Prize in Chemistry for his advancement of computational protein design, with the other half being shared by Demis Hassabis and John Jumper of Google DeepMind for protein structure prediction. Due to these and other successes (e.g., see examples below), protein design has become one of the most important tools available for protein engineering. There is great hope that the design of new proteins, small and large, will have uses in biomedicine and bioengineering.
Most often, the target structure is based on a known structure of another protein. However, novel folds not seen in nature have been made increasingly possible. Peter S. Kim and coworkers designed trimers and tetramers of unnatural coiled coils, which had not been seen before in nature. The protein Top7, developed in David Baker's lab, was designed completely using protein design algorithms, to a completely novel fold. More recently, Baker and coworkers developed a series of principles to design ideal globular protein structures based on folding funnel that bridge between secondary structure prediction and tertiary structures. These principles, which build on both protein structure prediction and protein design, were used to design five different novel protein topologies.
Both de novo designs and protein redesigns can establish rules on the sequence space: the specific amino acids that are allowed at each mutable residue position. For example, the composition of the surface of the RSC3 probe to select HIV-broadly neutralizing antibodies was restricted based on evolutionary data and charge balancing. Many of the earliest attempts on protein design were heavily based on empiric rules on the sequence space. Moreover, the design of fibrous proteins usually follows strict rules on the sequence space. Collagen-based designed proteins, for example, are often composed of Gly-Pro-X repeating patterns. The advent of computational techniques allows designing proteins with no human intervention in sequence selection.
Thus, an essential parameter of any design process is the amount of flexibility allowed for both the side-chains and the backbone. In the simplest models, the protein backbone is kept rigid while some of the protein side-chains are allowed to change conformations. However, side-chains can have many degrees of freedom in their bond lengths, bond angles, and χ dihedral angles. To simplify this space, protein design methods use rotamer libraries that assume ideal values for bond lengths and bond angles, while restricting χ dihedral angles to a few frequently observed low-energy conformations termed rotamers.
Rotamer libraries are derived from the statistical analysis of many protein structures. Backbone-independent rotamer libraries describe all rotamers. Backbone-dependent rotamer libraries, in contrast, describe the rotamers as how likely they are to appear depending on the protein backbone arrangement around the side chain. Most protein design programs use one conformation (e.g., the modal value for rotamer dihedrals in space) or several points in the region described by the rotamer; the OSPREY protein design program, in contrast, models the entire continuous region.
Although rational protein design must preserve the general backbone fold a protein, allowing some backbone flexibility can significantly increase the number of sequences that fold to the structure while maintaining the general fold of the protein. Backbone flexibility is especially important in protein redesign because sequence mutations often result in small changes to the backbone structure. Moreover, backbone flexibility can be essential for more advanced applications of protein design, such as binding prediction and enzyme design. Some models of protein design backbone flexibility include small and continuous global backbone movements, discrete backbone samples around the target fold, backrub motions, and protein loop flexibility.
The most accurate energy functions are those based on quantum mechanical simulations. However, such simulations are too slow and typically impractical for protein design. Instead, many protein design algorithms use either physics-based energy functions adapted from molecular mechanics simulation programs, knowledge based energy-functions, or a hybrid mix of both. The trend has been toward using more physics-based potential energy functions.
Physics-based energy functions, such as AMBER and CHARMM, are typically derived from quantum mechanical simulations, and experimental data from thermodynamics, crystallography, and spectroscopy. These energy functions typically simplify physical energy function and make them pairwise decomposable, meaning that the total energy of a protein conformation can be calculated by adding the pairwise energy between each atom pair, which makes them attractive for optimization algorithms. Physics-based energy functions typically model an attractive-repulsive Lennard-Jones term between atoms and a pairwise electrostatics coulombic term between non-bonded atoms.
Statistical potentials, in contrast to physics-based potentials, have the advantage of being fast to compute, of accounting implicitly of complex effects and being less sensitive to small changes in the protein structure. These energy functions are based on deriving energy values from frequency of appearance on a structural database.
Protein design, however, has requirements that can sometimes be limited in molecular mechanics force-fields. Molecular mechanics force-fields, which have been used mostly in molecular dynamics simulations, are optimized for the simulation of single sequences, but protein design searches through many conformations of many sequences. Thus, molecular mechanics force-fields must be tailored for protein design. In practice, protein design energy functions often incorporate both statistical terms and physics-based terms. For example, the Rosetta energy function, one of the most-used energy functions, incorporates physics-based energy terms originating in the CHARMM energy function, and statistical energy terms, such as rotamer probability and knowledge-based electrostatics. Typically, energy functions are highly customized between laboratories, and specifically tailored for every design.
Individual water molecules can sometimes have a crucial structural role in the core of proteins, and in protein–protein or protein–ligand interactions. Failing to model such waters can result in mispredictions of the optimal sequence of a protein–protein interface. As an alternative, water molecules can be added to rotamers.
The number of candidate protein sequences, however, grows exponentially with the number of protein residues; for example, there are 20100 protein sequences of length 100. Furthermore, even if amino acid side-chain conformations are limited to a few rotamers (see Structural flexibility), this results in an exponential number of conformations for each sequence. Thus, in our 100 residue protein, and assuming that each amino acid has exactly 10 rotamers, a search algorithm that searches this space will have to search over 200100 protein conformations.
The most common energy functions can be decomposed into pairwise terms between rotamers and amino acid types, which casts the problem as a combinatorial one, and powerful optimization algorithms can be used to solve it. In those cases, the total energy of each conformation belonging to each sequence can be formulated as a sum of individual and pairwise terms between residue positions. If a designer is interested only in the best sequence, the protein design algorithm only requires the lowest-energy conformation of the lowest-energy sequence. In these cases, the amino acid identity of each rotamer can be ignored and all rotamers belonging to different amino acids can be treated the same. Let r i be a rotamer at residue position i in the protein chain, and E( r i) the potential energy between the internal atoms of the rotamer. Let E( r i, r j) be the potential energy between r i and rotamer r j at residue position j. Then, we define the optimization problem as one of finding the conformation of minimum energy ( E T):
The problem of minimizing ET is an NP-hard problem. Even though the class of problems is NP-hard, in practice many instances of protein design can be solved exactly or optimized satisfactorily through heuristic methods.
Some protein design algorithms are listed below. Although these algorithms address only the most basic formulation of the protein design problem, Equation (), when the optimization goal changes because designers introduce improvements and extensions to the protein design model, such as improvements to the structural flexibility allowed (e.g., protein backbone flexibility) or including sophisticated energy terms, many of the extensions on protein design that improve modeling are built atop these algorithms. For example, Rosetta Design incorporates sophisticated energy terms, and backbone flexibility using Monte Carlo as the underlying optimizing algorithm. OSPREY's algorithms build on the dead-end elimination algorithm and A* to incorporate continuous backbone and side-chain movements. Thus, these algorithms provide a good perspective on the different kinds of algorithms available for protein design.
In 2020 scientists reported the development of an AI-based process using genome databases for evolution-based designing of novel proteins. They used deep learning to identify design-rules. In 2022, a study reported deep learning software that can design proteins that contain pre-specified .
Other powerful extensions to the dead-end elimination algorithm include the pairs elimination criterion, and the generalized dead-end elimination criterion. This algorithm has also been extended to handle continuous rotamers with provable guarantees.
Although the Dead-end elimination algorithm runs in polynomial time on each iteration, it cannot guarantee convergence. If, after a certain number of iterations, the dead-end elimination algorithm does not prune any more rotamers, then either rotamers have to be merged or another search algorithm must be used to search the remaining search space. In such cases, the dead-end elimination acts as a pre-filtering algorithm to reduce the search space, while other algorithms, such as A*, Monte Carlo, Linear Programming, or FASTER are used to search the remaining search space.
A popular search algorithm for protein design is the A* search algorithm. A* computes a lower-bound score on each partial tree path that lower bounds (with guarantees) the energy of each of the expanded rotamers. Each partial conformation is added to a priority queue and at each iteration the partial path with the lowest lower bound is popped from the queue and expanded. The algorithm stops once a full conformation has been enumerated and guarantees that the conformation is the optimal.
The A* score f in protein design consists of two parts, f=g+h. g is the exact energy of the rotamers that have already been assigned in the partial conformation. h is a lower bound on the energy of the rotamers that have not yet been assigned. Each is designed as follows, where d is the index of the last assigned residue in the partial conformation.
s.t.
ILP solvers, such as CPLEX, can compute the exact optimal solution for large instances of protein design problems. These solvers use a linear programming relaxation of the problem, where qi and qij are allowed to take continuous values, in combination with a branch and cut algorithm to search only a small portion of the conformation space for the optimal solution. ILP solvers have been shown to solve many instances of the side-chain placement problem.
where β is the Boltzmann constant and the temperature T can be chosen such that in the initial rounds it is high and it is slowly annealed to overcome local minima.
Both max-product and sum-product belief propagation have been used to optimize protein design.
Great progress in de novo enzyme design, and redesign, was made in the first decade of the 21st century. In three major studies, David Baker and coworkers de novo designed enzymes for the retro-aldol reaction, a Kemp-elimination reaction, and for the Diels-Alder reaction. Furthermore, Stephen Mayo and coworkers developed an iterative method to design the most efficient known enzyme for the Kemp-elimination reaction. Also, in the laboratory of Bruce Donald, computational protein design was used to switch the specificity of one of the of the nonribosomal peptide synthetase that produces Gramicidin S, from its natural substrate Phenylalanine to other noncognate substrates including charged amino acids; the redesigned enzymes had activities close to those of the wild-type.
The methodology of semi-rational design emphasizes the in-depth understanding of enzymes and the control of the evolutionary process. It allows researchers to use known information to guide the evolutionary process, thereby improving efficiency and success rate. This method plays an important role in protein function modification because it can combine the advantages of irrational design and rational design, and can explore unknown space and use known knowledge for targeted modification.
Semi-rational design has a wide range of applications, including but not limited to enzyme optimization, modification of drug targets, evolution of biocatalysts, etc. Through this method, researchers can more effectively improve the functional properties of proteins to meet specific biotechnology or medical needs. Although this method has high requirements for information and technology and is relatively difficult to implement, with the development of computing technology and bioinformatics, the application prospects of semi-rational design in protein engineering are becoming more and more broad.
Protein–protein interactions can be designed using protein design algorithms because the principles that rule protein stability also rule protein–protein binding. Protein–protein interaction design, however, presents challenges not commonly present in protein design. One of the most important challenges is that, in general, the interfaces between proteins are more polar than protein cores, and binding involves a tradeoff between desolvation and hydrogen bond formation. To overcome this challenge, Bruce Tidor and coworkers developed a method to improve the affinity of antibodies by focusing on electrostatic contributions. They found that, for the antibodies designed in the study, reducing the desolvation costs of the residues in the interface increased the affinity of the binding pair.
The K* algorithm approximates the binding constant of the algorithm by including conformational entropy into the free energy calculation. The K* algorithm considers only the lowest-energy conformations of the free and bound complexes (denoted by the sets P, L, and PL) to approximate the partition functions of each complex:
In a sense, protein design is a subset of circuit design.
|
|